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Abstract. We consider the problem of publishing location datasets, in 
particular 2D spatial pointsets, in a differentially private manner. Many 
existing mechanisms focus on frequency counts of the points in some 
a priori partition of the domain that is difficult to determine. We pro- 
pose an approach that adds noise directly to the point, or to a group 
of neighboring points. Our approach is based on the observation that, 
the sensitivity of sorting, as a function on sets of real numbers, can be 
bounded. Together with isotonic regression, the dataset can be accu- 
rately reconstructed. To extend the mechanism to higher dimension, we 
employ locality preserving function to map the dataset to a bounded 
interval. Although there are fundamental limits on the performance of 
locality preserving functions, fortunately, our problem only requires dis- 
tance preservation in the "easier" direction, and the well-known Hilbert 
space-filling curve suffices to provide high accuracy. The publishing pro- 
cess is simple from the publisher's point of view: the publisher just needs 
to map the data, sort them, group them, add Laplace noise and publish 
the dataset. The only parameter to determine is the group size which 
can be chosen based on predicted generalization errors. Empirical study 
shows that the published dataset can also exploited to answer other 
queries, for example, range query and median query, accurately. 



1 Introduction 

The popularity of personal devices equipped with location sensors leads to large 
amount of location data being gathered. Such data contain rich information and 
would be valuable if they can be shared and published. As the data may reveal 
location of identified individual, it is important to anonymize the data before 
publishing. The recently developed notion of differential privacy [5] provides a 
strong form of privacy assurance regardless of the background information held 
by the adversaries. Such assurance is important as many case studies and past 
events have shown that a seemingly annoymized dataset together with additional 
knowledge held by the adversary could reveal information on individuals. 

Most studies on differential privacy focus on publishing statistical values, for 
instance, k-means|2], private coreset[7j, and median of the database[19j. Publish- 
ing specific statistics or data-mining results is meaningful if the publisher knows 



what the public specifically want. However, there are situations where the pub- 
lishers want to give the public greater flexibility in analyzing and exploring the 
data, for example, using different visualization techniques. In such scenarios, it 
is desired to "publish data, not the data mining result" [S]. 
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Fig. 1. Twitter location data cropped at the North America region. To avoid 
clogging, only 10% of the points (randomly chosen) are plotted. 



In this paper, we consider the problem of publishing location data, or other 
low dimensional data in a differential private manner. An example is shown in 
Fig. [I] which depicts the locations of 183,072 Twitter users in North America pQ, 
and Fig. [2] shows a sequence of sorted real number obtained by mapping the 
points in Fig. [T] into the unit interval. We proposed a mechanism based on the 
observation that sorting, as a function that takes in a set of real numbers from the 
unit interval, interestingly has sensitivity one (Theorem [I]) . Hence e-differential 
privacy can be achieved by adding Laplace noise with a scale parameter 1/e 
directly to the sorted sequence. Fig. [3] shows such noisy data by adding noise to 
the curve in Fig. [2j Although seemingly noisy, as the original sequence is sorted, 
there are dependencies among them to be exploited. Fig.[4]shows a reconstructed 
sequence using isotonic regression. 

To further reduce perturbation induced by the Laplace noise, consecutive 
elements in the sorted sequence can be grouped. However, grouping introduces 
generalization error. The amount of generalization error in the "worst case" can 
be analytically determined, and together with the model of error induced by the 
Laplace noise, the publisher can choose an appropriate group size k based on 
the privacy requirement e and the total number of points n. For the example in 
Fig. 1, the group size determined is 300 and the corresponding published and 
reconstructed data are depicted in Fig. [5j Fig. [6] shows a comparison of the error 
of each points in the reconstructed sequence. After reconstruction, the inverse 
mapping is to be applied to the data. Our variant of isotonic regression outputs 
dataset with larger number of repetition, as shown in Fig. [7| Fig. [H] shows the 
reconstructed pointset, with a post-processing that maps the repeating points 



to the surrounding area of the location. The post-processing is solely for the 
purpose of visualization. Fig. [9] shows a zoomed version region around New York 
City. 




Fig. 2. The ID data mapped from 2D Twitter locations (Fig. [T]) to the unit 
interval [0, 1]. 



Fig. 3. The published noisy location data with group size k = 1 and e = 1, 
that is, there is no grouping. To avoid clogging, only 2% of the published points 
(randomly chosen) are plotted. 



Our choice of the group size k is determined by minimizing an error func- 
tion which measures the Earth-Mover-Distance (EMD) of the original and re- 
constructed pointsets. In one-dimension, the EMD of two equal-sized pointsets 
is simply the L\ distance between the two respective sorted sequences. Al- 
though designed to minimize EMD, the proposed mechanism achieves good ac- 
curacy w.r.t. other utilities. Experimental studies show that the proposed mecha- 
nism achieves higher accuracy compared to the wavelet-based method for range 
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Fig. 4. The reconstructed data obtained by performing isotonic regression on 
the published data as shown in Fig. [3j plotted together with the original data. 
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Fig. 5. The published noisy location data with grouping (group size k = 300) 
and the corresponding reconstructed data through isotonic regression. The figure 
shows only the region from 0.5 to 0.7 in the unit interval. 



aueriesffi)]. and outperforms the equi-width histogram w.r.t. the accuracy in 
estimation the underlying probability density function. 

An advantage of the proposed mechanism is its simplicity from the publisher's 
viewpoint. The publisher only has to map the points to the unit interval, sort 
them, add Laplace noise, and publish the results. By publishing the "raw" noisy 
data instead of the reconstructed data, users in the public are not confined to a 
particular inference techniques, and have the flexibility in using different variants 
of isotonic regression to suit their needs. 

In contrast to an equi-width histogram, the bins of an equi-depth histogram 
contains same number of elements, with their width varies. Intuitively, the size 
of a bin is larger in location with lower "density" of points. There are exten- 
sive studies on equi-depth histogram, and it is generally well accepted that an 
equi-depth histogram provides more useful statistical information [2D] compare 
to equi-width histogram. However, it is not clear how to generate an equi-depth 
histogram while achieving differential privacy. Interestingly, grouping in our pro- 




Fig. 6. Differences of the two reconstructed data from the original. The black 
dashed line is the displacement of reconstructed data without grouping (see Fig. 
[4j , the blue solid line is the displacement of reconstructed data with group size 
k = 300 (see Fig. [5). 
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Fig. 7. Inverse mapped of the reconstructed data with e = 5. Each point in the 
figure represents a group of repeating points. 



posed mechanism naturally produces equi-depth histograms: grouping of k ele- 
ments leads to a depth of k. 

Our approach can be applied to obtain order-statistic, for example, median. 
Finding median is challenging due to its large sensitivity. Accurate mechanism 
can be derived by adding Laplace noise proportional to the smooth sensitivity fT9"] 
instead of the global sensitivity. However, computation of smooth sensitivity 
takes (9(n 2 ) time where n is the dataset's size. In contrast, our mechanism takes 
0(n) time when the dataset is already sorted. Experimental studies on datasets 
with 129 element^] suggest that the proposed mechanism is less sensitive to a 
higher local sensitivity, or a small e. 



1 As it is computational intensive to compute smooth sensitivity, we are unable to 
repeat the experiments for significant larger n. 
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Fig. 8. Inverse mapped of the reconstructed data followed by post-processing. 
The post-processing "diffuses" a repeated point to its surrounding and is per- 
formed solely for visualization purpose. To avoid clogging, only 10% of the points 
(randomly chosen) are plotted. 
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Fig. 9. A zoom-in of Fig. [8] to region within the indicated rectangle. 



The locality preserving map is a key component in our mechanism, taking the 
role of transforming the data points to the one-dimension space. Although there 
are fundamental limits on locality preserving mapping, fortunately, our problem 
only requires preservation in the "easier" direction, i.e., any pair of neighbors 
in the one-dimensional domain are also neighbors in the multi-dimensional do- 
main. The classic Hilbert space-filling curve suffices to provide high accuracy. 
For other types of non-spatial data, our techniques can be applied as long as an 
appropriate locality preserving mapping is available. 



Organization: We first describe some background materials in the next section 
(Section [2]). In Section [3] we present our main ideas and mechanism, and show 
that the proposed mechanism achieves differential privacy in Section [4] Next, in 
section [5j we formulate and analyze the noise incurred by the Laplace and the 
generalization noise. Based on the noise model, we derive a strategy to choose 
the group size. In Section [6j we compare our mechanism with three known mech- 



anisms: (1) equi-width histogram, (2) wavelet-based method for range queries, 
and (3) smooth-sensitivity based median finding. In Section [7j we describe a few 
possible extensions, in particular, a hybrid of our mechanism with equi-width 
histogram. Lastly, we describe related works in Section [8] and conclude in Section 

m 

2 Background 

2.1 Differential Privacy and Laplace Noise 

We treat a database as a multi-set (i.e. a set with possibly repeating elements), 
and define two D-y and D 2 to be neighbor when D 2 can be obtained from D\ by 
replacing one element, i.e. D\ = {x} U D 2 \ {y} for some a; and y. Let us call the 
above definition of neighborhood the replacement neighborhood. 

A randomized algorithm (also known as a mechanism) A achieves e differen- 
tial privacy if, 

Pr[A(Di) eS]< exp{e) x Pr[A{D 2 ) € S] 

for all S C Range(_4), where Range(.A) denotes the output range of the algorithm 
A, and for any pair of neighbouring datasets D\ and D 2 . 

The replacement neighborhood we adopted is similar to the notion used by 
Nissim et al |19| . Such variant differs from the well-adopted notion that treats 
two datasets D\, D 2 to be neighbors iff D 2 = D\U{x} or D 2 = D\\{x} for some 
x. Note that a mechanism that achieves differential privacy under replacement 
neighborhood can be converted to one that achieves privacy under the well- 
adopted neighborhood. 

For a function / : D — > K fc , the sensitivity [5] of / is defined as 

A(f) :=m a ,x\\f(D 1 )-f(D 2 )\\ 1 

where the maximum is taken over all pairs of neighboring D\ and D 2 . It can be 
shown that the mechanism A 

A(D) = f(D) + (Lap(A(f)/e)) k 

achieves e-differential privacy, where (Lap(A(f) / e)) k is a vector of k indepen- 
dently and randomly chosen values from the Laplace distribution with standard 
deviation 2Z\(/)/e. 

It is meaningless if the output of a mechanism is simply noise, even if privacy 
is achieved. The accuracy of a mechanism is measured by a utility function 
u(X,y) that measures the quality of the output y given the dataset is X. Al- 
ternatively, the utility can be measured by an error function that measures the 
distant of the output from the ideal output. 

The notion of differential privacy has a useful sequential composition property 
[IB] : if mechanisms A^i and A4 2 achieve t\ and e2-differential privacy respec- 
tively, then the combined mechanism of applying Mi follows by M 2 achieves 
£1+62 differential privacy. 



2.2 Isotonic Regression 



Given a sequence of n real numbers oi, . . . , a n , the problem of finding the least- 
square fit Xx,...,x n subjected to the constraints Xi < Xj for all i < j < n is 
known as the isotonic regression. Formally, we wants to find the x±, . . . , x n that 
minimizes 

n 

^^(xj — a,i) 2 , subjected to Xi < Xj for all 1 < i < j < n 
»=i 

The unique solution can be efficiently found using pool-adjacent-violators al- 
gorithms in 0(n) time [10] . When minimizing w.r.t. i-\ norm, there is also an 
efficient O(nlogn) algorithm|23j. There are many variants of isotonic regression, 
for example, having a smoothness component in the objective function |25|17j . 

2.3 Locality Preserving Map 

A locality preserving map T : M. d — > R maps d-dimensional points to real num- 
bers while preserving "locality" . In this paper, we seek mapping whereby two 
neighboring points in the one-dimensional range are also neighboring points in 
the (i-dimensional domain. Specifically, there is some constant A s.t. for any 

x, y <E M. d , 

\\x-y\\ 2 <A-(T(x)-T(y)) 1 / d 

The well-known Hilbert curve achieves \\x — < 3yJ\T(x) — T(y)\ — 2 for two 
sufficiently far-aparted points x, y in M 2 [S]. Niedermeier et al. [TH] showed that 
with careful construction, the bound can be improved to t/4|T(x) - T(y)\ - 2. 
In our construction, for simplicity, we use Hilbert curve. 

Note that it is challenging in preserving locality in the other direction, that 
is, any two neighboring points in the d-dimensional domain are also neighboring 
points in the one-dimensional range. Fortunately, in our problem, such property 
is not required. 

2.4 Datasets 

We conduct experiments on two datasets: locations of Twitter users [1] (herein 
called the Twitter location dataset) and the dataset collected by Kaluza et 
al. [T3] (herein called Kaluza's dataset). The Twitter location dataset contains 
over 1 million Twitter users' data from the period of March 2006 to March 2010, 
among which around 200,000 tuples are labeled with location (represented in 
latitude and longitude) and most of the tuples are in the North America conti- 
nent, concentrating in regions around the state of New York and California. Fig. 
[T] shows the cropped region covering most of the North America continent. The 
cropped region contains 183,072 tuples. The Kaluza's dataset contains 164,860 
tuples collected from tags that continuously records the locations information of 
5 individuals. 



3 Proposed Approach 



Given the privacy requirement e and a dataset D of size n, the publisher carries 
out the following: 

Al. Maps each point in I? to a real number in the unit interval [0,1] using 
a locality preserving map T. Let T(D) be the set of transformed points. 
Determine a group size k based on n and e. For clarity in exposition, let us 
assume that k divides n. 

A2. Sorts T(D). Divides the sorted sequence into groups of k consecutive ele- 
ments. For each group, determines its sum. Let the sums be S = (si, . . . , s n /k)- 

A3. Publishes S = S + (Lap^" 1 ))^) and the group size fc. 

An user in the public may extract information from the published data as 
follow: 

Bl. Performs isotonic regression on k~ 1 S, and maps the data point back to their 
original domain. That is, computes D = T -1 (TRQe -1 S)) , where IR(-) de- 
notes isotonic regression. Let us call D the reconstructed data. 



Remark. 

1. The size of the dataset n is not considered to be a secret and can be derived 
from the published S. The transformation T and the lookup table for k are 
public knowledge prior to the publishing. 

2. When the database size n is unknown to the user, the publisher can exploit 
the sequential composition property of differential privacy and carry out the 
following steps: (1) Firstly, publishes a noisy size n using a portion of the 
privacy "budget". (2) Next, extracts exactly n points from the dataset using 
a deterministic padding algorithm as follow: if n > n, inserts (n — n) number 
of 0's to the dataset; if n < n, removes (n — n) smallest elements. (3) Lastly, 
publishes the padded pointset using our proposed mechanism. 

3. To relieve the public users from computing step Bl, the regression can be 
carried out by the publisher on behalf of the users. Nevertheless, the raw 
data S should be (but not necessary) published alongside the reconstructed 
data. 

4. The public is not confined to adopt a particular isotonic regression. After 
S is published, various inference techniques can be applied. For instance, a 
user may perform a variant of isotonic regression that optimizes objective 
functions with a smoothness component [25117]. 

5. The publisher's main design decisions are the choice of T and the group 
size k. The choice of T depends on the underlying metric of the points. 
For Euclidean distance in two-dimensional space, the classic Hilbert curve 
already attains good performance. The group size k can be computed from 
the lookup table constructed using our proposed noise model. 



4 Security Analysis: Sensitivity of Sorting is bounded 



In this section, we show that the proposed mechanism (Step Al to A3) achieves 
differential privacy, and thus also the reconstructed pointset output by Bl. The 
following theorem shows that sorting, as a function, interestingly has sensitivity 
1. Note that a straightforward analysis that treats each element independently 
could lead to a bound of n, which is too large to be useful. 

Theorem 1. Let S n (D) be a function that on input D, which is a multiset con- 
taining n real numbers from the unit interval [0, 1], outputs the sorted sequence 
of elements in D. The sensitivity of S n w.r.t. the replacement neighborhood is 1. 

Proof. Let Let D\ and D2 be any two neighbouring datasets. {x\, x-i . . . Xi . . . x n ) 
be S n (Di), i.e. the sorted sequence of D\. WLOG, let us assume that an element 
Xi is replaced by a larger value A to give D 2 , for some 1 < i < n— 1 and Xi < A. 
Let j to be largest index s.t. Xj < A < 1. Hence, the sorted sequence of D 2 is: 

Xl : X 2: • • ■ , Xi— I , X{-\-l , . . . , Xj , A , Xj-^i , . . . , x n 

The Li difference due to the replacement is, 

||S n (Di)-S n (D 2 )||i 
= \x i+1 - Xi\ + \x i+2 - x i+1 \ + ... 

+\xj-Xj-i\ + \A- Xj \ 
= (x i+ i - Xi) + (x l+2 - X i+ i) + ... 

+(x j -Xj- 1 ) + (A-x j ) 

= A - Xi < 1 

We can easily find an instance of D\ and D 2 where the difference A — Xi = 1. 
Hence, the sensitivity is 1. 

In the proof, the fact that the sequence is sorted is exploited to obtain the 
bound. Since the sensitivity is 1, the mechanism S n (D) + Lap(l/e) n enjoys e- 
diffcrential privacy. Also note that the value of n is fixed. Hence, in the context 
of data publishing, the size of D is not a secret and is made known to the public. 

Next, we show that grouping (in Step A2) has no effect on the sensitivity. 

Corollary 1. Consider a partition H — {hi, hi ■ ■ ■ h m } of the indices {1, 2, ... , n}. 

Let Sh{D) be the function that, on input D, which is a multiset containing n 
real numbers from the unit interval [0, 1] , outputs a sequence of m numbers: 

Hi = x jj 

for 1 < i < m where (x\,X2, • ■ • , x n ) is the sorted sequence of D. The sensitivity 
of Sh w.r.t. the replacement neigbourhood is 1. 



Proof. Let us consider two neighbouring datasets D\ and Z?2, and their respec- 
tive sorted sequences be 

1 ) j • • • ) <£n 



and 



15 2 5 * * " ) T, 



WLOG, let us assume that -D2 is obtained by replacing an element in D\ with 
a strictly larger element. Thus, X{ < x' i for all i's. 
The Li difference due to the replacement is 



|£k(£>i)-Sir(D 2 )||i 



E 



E x » E Xi 
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Note that the grouping in step A2 is a special partition with equal-sized 
ft-i's . Hence, Corollary Ql gives a more general result where H can be any par- 
tition. From Corollary 111 the proposed mechanism that publishes S achieves 
e-diffcrcntial privacy. 



5 Analysis and Parameter Determination 

The main goal of this section is to analyze the effect of the privacy requirement 
e, dataset size n and the group size k on inducing the error in the reconstructed 
data, which in turn provides a strategy in choosing the parameter k from the 
given n and e. 

Intuitively, in the absent of "generalization noise" , when n is larger, there 
are more constraints in the isotonic regression, leading to a more accurate recon- 
struction. Grouping affects the accuracy in two opposing ways. It reduces the 
number of constraints for regression, and introduces generalization error. On the 
other hand, the Laplace noise is essentially reduced by a factor of k. By taking 
into account of the above factor, and the effect of generalization noise, we can 
determine the optimal k. 



5.1 Error function and Utility 



We use an error function related to the Earth-Mover-Distance (EMD) [5T] to 
quality the utility of the published data. The EMD between two pointsets of 



equal size is defined to be the minimum cost of bipartite matching between the 
two sets, where the cost of an edge linking two points is the cost of moving 
one point to the other. Hence, EMD can be viewed as the minimum cost of 
transforming one pointset to the other. Different variants of EMD differ on how 
the cost is defined. In this paper, we adopt the typical definition that defines the 
cost as the Euclidean distant between the two points. 

In one-dimensional space, the EMD between two sets D and D is simply 
the L\ norm of the differences between the two respective sorted sequences, i.e. 
\\S n (D) — S n (D)\\i, which can be efficiently computed. In other words, 

n 

EMD(D,-D)=X>i-pi| (1) 

i=l 

where piS and p^s are the sorted sequence of D and D respectively. 

Given a pointset D and the published pointset D of a mechanism M. where 
|D| = \D\ = n, let us define the normalized error as ^-EMD(Z), D) and denote 
Err_M,v the expected normalized error, 



Err MD = Exp 



- EMD(D,D) 
n 



(2) 



where the expectation is taken over the randomness in the mechanism. 

Although EMD can be computed efficiently for one-dimensional pointsets, 
the best known algorithm that computes EMD in higher dimension has cubic 
running time j!4j . Jang et al. |12j proposed a fast approximation that employs a 
space-filling curve. Similarly, for higher dimensional space, we approximate the 
EMD by first map each point to a real number in [0, 1] through a space-filling 
curve, and then compute the EMD in the one-dimensional space in 0(n log n) 
time. 



5.2 Error incurred from Laplace Noise 

Let us first omit the effect of grouping and consider cases where k = 1. We 
conduct experimental studies on four types of pointsets with varying size n: (1) 
Multisets containing elements with the same value 0.5 (herein called "repeating 
single-value dataset), (2) sets containing equally-spaced numbers (i/(n — 1)) 
for i = 0, ...,n— 1 (herein call equally-spaced dataset), (3) sets containing 
n randomly chosen elements from the Twitter location data [JJ, and (4) sets 
containing n randomly chosen elements from the Kaluza's data |13j . 

Fig. [lO] shows the expected normalized error. Each value on the graph is the 
average over 500 sample runs. Not surprisingly, the expected error reduces when 
the number of points increases. Fig. [TJJ shows the expected normalized error 
for dataset on equally-spaced points for different e. The results agree with the 
intuition that when e is increased by a factor of c, the error would approximately 
decrease by factor of c as shown in Fig. [l2j 




Fig. 10. The expected normalized error without grouping versus the size of the 
dataset. The red solid line is for repeating single- value dataset, the black dashed 
line is for equally-spaced numbers, the purple dotted line is for the Kaluza's 
dataset and the blue dash-dot line is for the Twitter location dataset. 
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Fig. 11. The expected normalized error without grouping versus the size of 
dataset for different the security parameter e. 



5.3 Effect of Grouping on Laplace Noise 

Now, we consider cases where k > 1. Grouping reduces the number of con- 
strains by a factor of k. As suggested by Fig. [lOj when the number of datapoint 
decreases, error increases. 

On the other hand, recall that the regression is performed on the published 
values divided by k (see the role of k in Step Bl). This essentially reduces the 
level of Laplace noise by a factor of k. Hence, the accuracy attained by grouping 
k elements is "equivalent" to the accuracy attained without grouping but with 
the privacy parameter e increased by a factor of k. 



From Fig. 10 we can predict the effects of grouping on the repeating single- 
value dataset. For instance, if n = 10, 000 and e = 1 and k = 5, without grouping, 
the reconstructed points are expected to have a 0.02 error; whereas with grouping 
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Fig. 12. The ratio of the expected normalized error with different e against the 
expected normalized error with e = 1. 



of size 5, the expected error is 0.05/5 = 0.01. Fig. 13 shows the predicted errors 
under different fe's, for n = 10, 000 and e = 1 of different datasets. 
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Fig. 13. Predicted error versus different group size without generalization noise 
for different datasets of size n — 10,000 and e = 1. The red solid line is the 
upper bound, the black dashed line, purple dotted line and blue dash-dot line 
are for equally-spaced numbers dataset, Kaluza's dataset and Twitter location 
dataset respectively. 



5.4 Effects of Grouping on Generalization Noise 

The negative effect of grouping is the generalization noise, as all elements in a 
group is represented by their mean. Before giving formal description of general- 
ization noise, let us introduce some notations. 



Given a sequence D = (xi, . . . , x n ) of n numbers, and a parameter k, where 
k divides n, let us call the following function downs ampling: 



ik (D) = ( Sl , 



9 (n/k)} 



where each Sj is the average of affc(<-i)+i! . . . , x^. Given a sequence D' 
and k, let us call the following function upsampling, 



' S m) 



where 



'L(i-i)/fcj+i 



tk (D') = {x[,...,x mk ) 
for each i. 



The normalized generalization error is defined as, 

Gen D . k = -\\D-n (ik (D))\\i 
n 

It is easy to see that, for any k and D 7 the normalized generalization error 



is at most k/(2n). Fig. 14 shows the generalization error of different group size 



a dataset containing 10, 000 equally-spaced values, a dataset containing 10, 000 
numbers randomly drawn from the transformed Kaluza's dataset and a dataset 
of 10, 000 numbers randomly drawn from the transformed Twitter location data. 
They agrees with our upper bound on the generalization error. 

Furthermore, the worst case occurred when the values in the groups divided 
equally between two values, for example, half of them have value 0, and half 
of them have value 1. This is very unlikely. Intuitively, even if the elements in 
the groups only have two distinct value a and b, the number of elements having 
value a may vary. Hence, one would expect the average generalization error to be 
k/[An). Fig. 14 shows that such approximation is very accurate and consistent 



for various datasets. 
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Fig. 14. Generalization error of different group size for different datasets of size 
n = 10,000 and e = 1. The red solid line is for the Kaluza's dataset, the blue 
dotted line is for the equally-spaced dataset and the black dashed line for the 
Twitter location dataset. 



5.5 Combined effects of grouping 



Now, let us combine the effects of both grouping and Laplace noises on the 
normalized error Errr>. Let us consider the mechanism that, on input D and the 
parameter k, outputs 

M k (D) =t fc (IR(| fc + Lap(l)"/ fc )) 

This mechanism is essentially similar to our proposed method, but with the 
difference on how k is chosen: here, the k is given as a parameter, whereas in 
Step Al of the proposed method, the k is chosen from a lookup table. Recall 
that the expected normalized error produced by this mechanism Ai k on D is 
denoted as Errj^ k ^. For abbreviation, we write it as 

Let S to be an instance of Ik (S n (D)) + Lap(l)"/ fc , and D the corresponding 
reconstructed dataset generated by M k , i-e. D =tfc (IR(5)). We have, 

EMD (D,D) = \\S n (D)- tfc (IR(£))||i 
= ||5 w (D)-tfc (Ik (Sn(D)) 

+ n a (s n (D)))-n (ir(s))||i _ 

< n ■ Gen D .k + || tfc (h (S n (D)))- % (IR(5))||i 
= n • Gen D , k + k-\\h (S n (D)) - IR(S)||i 

= n • Gen D , fc + k ■ EMD(| fc (5„(£>)),IR(S)) (3) 

Note that the first term n • Geno,k is a constant independent of the random 
choices made by the mechanism. Also note that the second term is the EMD 
between the down-sampled dataset and its reconstructed copy obtained using 
group size 1. By taking expectation over randomness of the mechanism (i.e. the 
Laplace noise Lap(l)"/ fe ), we have 

Errk,D < Gen k ,D + Err 1Ak{D) (4) 

In other words, the expected normalized error is bounded by the sum of normal- 
ized generalization error, and the normalized error incurred by the Laplace noise. 
Figure [15] shows the three values versus different group size k for equally-spaced 
data of size 10,000. The minimum of the expected normalized error suggests the 
optimal group size k. 

Fig. [16] illustrates the expected errors for different k on the Twitter location 
data with 10,000 points. The red dotted line is Err k ,D whereas the blue solid line 
is the sum in the right-hand-side of the inequality Q . Note that the differences 
between the two graphs are small. We have conducted experiments on other 
datasets and observed similar small differences. Hence, we take the sum as an 
approximation to the expected normalized error, 



Errk,D ~ Gen k ,D + Errt^o) 



(5) 
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Fig. 15. The normalized error derived from the generalization error and per- 
turbation error of different group size k for database of size n = 10, 000 and 
e = 1. 



5.6 Determining the group size k 



Now, we are ready to find the optimal k given e and n. From Fig. 10 and Fig. 14 
and the approximation given in equation we can determine the best group 
size k give the size of the database n and the security requirement e. From 
e and Fig. 10 we can obtain the value Erri^m^/e for different k. From the 



database's size n and Fig. 14 we can approximate Genu,D for different n. Thus, 
we can approximate the normalized error Errk.D with equation ^ as illustrated 
in Fig. [l5j Using the same approach, the best group size given different n and e 
can be calculated and is presented in table [2] 




Fig. 16. The predicted error for 10,000 points with e = 1 and the actual error of 
dataset contains 10,000 points randomly selected from Twitter location dataset. 



Table 1. The best group size k given n and e 





e = 0.5 


e = 1 


e = 2 


e = 3 


n= 2,000 


44 


29 


20 


12 


n= 5,000 


59 


37 


27 


18 


n= 10,000 


79 


51 


36 


27 


n= 20,000 


121 


83 


61 


41 


ii= 100,000 


234 


150 


98 


73 



5.7 Effect of Isotonic Regression 

Isotonic regression is not unbiased. Reconstructed points on the left side (i.e. 
having value smaller than the median) tend to have negative bias, whereas points 
on the right side (i.e. having value larger than the median) tend to have positive 
bias. The biasness usually is smaller for points nearer to the median or the two 
ends. 

We conducted experiments on an equally-spaced data of size 1000 to mea- 
sure the displacement (the difference between the reconstructed points and the 



original point). Fig. 17 shows the estimated distribution of the displacement of 
a smaller point (the 100th point) and a larger point (the 900th point) derived 
from 200,000 runs of the experiments. The displacement of the smaller point 
has a mean of —0.0163 and the larger point has a mean of 0.0162, both with a 
variance of 0.0138. 
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Fig. 17. The displacement distribution of the 100th point and the 900th point 
in the reconstructed sequence of our process on an equally-spaced pointset of 
size 1000. 



6 Comparisons 

In this section, we compare the performance of the proposed mechanism with 
three mechanisms w.r.t. different utility functions. The first mechanism outputs 



equi-width histograms differential privately. We treat the generated cqui-width 
histogram as an estimate of the underlying probability density function, and 
use the statistical distance between density functions as a measure of utility. 
Next, we investigate the wavelet-based mechanism proposed by Xiao et al. [2"6] 
and measure accuracy of range queries. Lastly, we consider the problem of es- 
timating median, and compare with a mechanism based on smooth sensitivity 
proposed by Nissim et al[19j. We remark that although comparisons are based 
on different utility functions, our proposed mechanism is the same, in particular, 
the parameter k is chosen from the same lookup table. 

6.1 Equi-width Histogram 

The equi-width histogram compresses of equal-sized non-overlapping bins. With 
respect to the replacement neighborhood, the histogram generation has sensi- 
tivity of 2 and thus adding the Laplace noise Lap(2/e) to each of the frequency 
counts will give a differentially private histogram. Note that the size of the bins 
has to be determined prior to publishing. Without good background knowledge 
of the pointset, it is difficult to determine a good choice of bin size, as the same 
bin size can lead to significantly different accuracy for different pointsets. 

Estimating the underlying pdf: A histogram of a pointset can be treated as an 
estimate of an underlying density function / whereby the points are drawn from. 
The value of f(x) can be estimated by the ratio of the number of samples, over 
the width of the bin where x belongs to, multiplies by some normalizing constant. 
Essentially, this estimation employs step function as its kernel in estimating the 
density function. 

In this section, we treat the estimation of the density function as the main 
usage of the published data. Hence, we qualify the mechanism's utility by the 
distance between the two estimated density functions: one that is derived from 
the original dataset, and the other that is derived from the mechanism's output. 

To facilitate comparison, we need an algorithm to estimate the density func- 
tion from the original pointset D. There are many ways to estimate the density 
function, and we adopt the following method: Let B be the set of distinct-points 
in D, and let us consider the Voronoi diagram of B. The cells in the Voronoi 
diagram are taken to be the bins of a histogram, from which an estimate of the 
density function is obtained. Note that the bins generated have variable sizes, 
and thus the above process can be treated as a form of "variable-bandwidth" 
kernel density estimation[22 , where the kernels are step functions with different 
shapes. 

Similarly we need to estimate the density function when given the output 
from our mechanism. Since isotonic regression is performed on the space-filling- 
curve, we adopt a variant of the above estimation with the Voronoi diagram 
computed in the transformed one-dimensional space. In other words, given the 
reconstructed D of multidimensional points, let B be the set of distinct- values in 
T(D) where T is the locality preserving map. Next, determine the Voronoi dia- 
gram of B, which comprises of a sequence of intervals. Such sequence of intervals 



form the bins of a histogram, from which an estimate of the density function is 
obtained. 

Experimental results: Fig. |18| |19| and [20] show the estimated density function 
from the Twitter's location dataset, the density functions reconstructed by noisy 
equi-width histogram and the dataset by our mechanism. For comparisons, 1% 
of the original points are plotted on top of the two reconstructed density func- 
tions. For the original dataset and the reconstructed dataset by our mechanism, 
we quantize the location domain into 1024 x 1024 units, and compute the es- 
timated density function using the aforementioned method. For the equi-width 
histogram, each bin is of 25 x 25 units, there there are in total 1681 bins. Fig. 
[21] and [22] show the details of the two reconstructed density functions at the 
region [420,560] x [720,840]. Observe that in the density function produced by 
our mechanism has "variable-sized" cells and thus is able to adaptively capture 
the fine details with small amount of noise. 
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Fig. 18. Density function estimated from the original dataset, with darker area 
representing larger value. 



The statistical difference, measured with ^i-norm and ^2-norm, between the 
two estimated density functions derived from the original and the mechanism's 
output are shown in Table [2] We remark that it is not easy to determine the 



optimal bin size for the equi-width histogram prior to publishing. Figure 23 



shows that the optimal bin size differs significantly for three different datasets. 



Table 2. The statistic difference between the estimated density function. 
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Fig. 19. Density function of dataset estimated using our mechanism with e = 3, 
with 1% of the original points (randomly selected) plotted on top. 
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Fig. 20. Density function estimated using equi-width histogram mechanism with 
e = 3, with 1% of the original points (randomly selected) plotted on top. 



6.2 Range Query 

Many applications are interested to ask the total number of points within the 
query range in a dataset. It is desired to publish a noisy pointset meeting the pri- 
vacy requirement, and yet able to provide accurate answers to the range queries. 
Publishing an equi-width histogram would not attain high accuracy if the size 
of the query ranges varies drastically. Intuitively, wavelet-based techniques are 
natural solutions to address such multi-scales queries. Xiao et al. [53] proposed a 
mechanism of adding Laplace noise to the coefficient of a wavelet transformation 
of an equi-width histogram. The noisy wavelet coefficients are then published, 
from which range queries can be answered. Essentially, what being published is 
a series of equi-width histograms with different widths (scales). Note that there 
are quite a number of parameters to be determined prior to publishing, including 
the widths at varies scales and the amounts of privacy budget they consumed. 

The answer to range queries can also be inferred from the output of our 
mechanism. Given a range, we can estimate the number of points within the 




Fig. 22. A zoom-in view of Fig. [20} 
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Fig. 23. The errors versus the bin size for different datasets. Each value on the 
graph is the average over 100 sample runs. 



range from the estimated density function (as described in Section 6.1 ) by accu- 
mulating the probability over the query region, and then multiply by the total 
number of points. 

We compare the wavelet-based mechanism, our mechanism and the equi- 
width histogram mechanism on the Twitter location dataset. For each range 
query, the absolute difference between the the true answer and the answer derived 
from the mechanism's output is taken as the error. We only consider square range 
queries in our experiments. For each query size y, 1,000 randomly selected square 
ranges with width y are taken as the queries, and the average error is shown in 
Fig. [24} 

In this experiment, we use Haar wavelet, and perform wavelet transform on 
the equi-width histogram with 512 x 512 bins. After that, appropriate noise 
is added to ensure e-differential privacy. To incorporate the knowledge of the 
database's size n, the DC component of the wavelet transform is set to be exactly 
n. Under this setting, the best group size for our mechanism is 51. 

Observe that as all mechanisms know the exact value of n, the accuracy 
improve when the range of the query covers more than half of the dataset. 
As expected, the wavelet-base method outperforms the equi-width histogram 
mechanism in larger size range queries, but performs badly for small range due 
to the accumulation of noise. Surprisingly, our mechanism outperforms the equi- 
width histogram method for small range queries, and outperforms the wavelet 
based method for all sizes. This is possibly due to the fact that the locations of 
our queries are uniformly randomly chosen over a continuous domain, and thus, 
it is very likely that the query boundaries do not match the bins, leading to large 
error. 



6.3 Median 

Finding the median accurately in a differential private manner is challenging due 
to the high "global sensitivity" : there are two datasets that differ by one element 
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Fig. 24. The average range query error over 1000 random square range queries 
for each query size. The blue dash-dot line is the error of the equi-width his- 
togram mechanism, the purple solid line is of the wavelet method, the red dashed 
line is our mechanism with group size 20 and the black dotted line is with group 
size 51. 



but having a completely different median. Nevertheless, for many instances, their 
"local sensitivity" are small. Nissim et al. [H] showed that in general, by adding 
noise proportional to the "smooth sensitivity" of the database instance, instead 
of the global sensitivity, can also ensure differential privacy. They also gave an 
Oln 2 ) algorithm that find the smooth sensitivity w.r.t. median. 



Our mechanism outputs the sorted sequence differential privately, it nat- 
urally gives the median. Compare to the smooth sensitivity-based mechanism, 
our mechanism can be efficiently carried out in 0(n) time with an sorted dataset. 



We conduct experiments on datasets of size 129 to compare the accuracy 
of both mechanisms. Due to the quadratic running time in determining smooth 
sensitivity, we are unable to further investigate larger datasets. The experiments 
are conducted for different local sensitivity and different e values. To construct 
a dataset with different local sensitivity, 66 random numbers are generated with 
exponential distribution and then scaled to the unit interval. The dataset con- 
tains the 66 random numbers and 63 ones. Figure 25 shows the average noise 
level of both mechanisms on different local sensitivity, and Figure [26] shows the 
noise level with different e on a dataset that has a local sensitivity of 0.3. 



Observe that when the local sensitivity of the median is high, our mecha- 
nism tends to provide a better result. In addition, our mechanism performs well 
under higher requirement of security: when the e is smaller, the accuracy of our 
mechanism decreases slower than the smooth sensitivity-based method. 
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Fig. 25. The error of median output by the two mechanisms versus different 
local sensitivities. The blue dots are the error incurred by our mechanism and the 
black circles are the error incurred by the smooth sensitivity-based mechanism. 
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Fig. 26. The error of median versus different e. The blue dashed line and black 
solid line are the error incurred by our mechanism and smooth sensitivity-based 
mechanism respectively. 

7 Extension and Future Works 
7.1 Hybrid Method 

The proposed mechanism can be viewed as the publishing of a "fixed-depth" 
histogram, where the number of elements in the histogram bins is fixed prior to 
publishing, whereas the mechanism outputs noisy bin's boundary. On the other 
hand, mechanisms based on frequency counts can be viewed as "fixed-width" 
histogram, where the boundary of the bins are fixed prior to publishing, whereas 
the mechanisms output noisy counts of elements in the bins. The fixed-depth and 
fixed-width histogram could complement each other, by alternatively publishing 
one after another. Here are two possibilities: 



Fixed-width-then-fixed-depth Let us take the Twitter location dataset shown 
in Figure [T] as an example. Observe that large portion of the region is sparse. If 
the sparse region can be omitted, the sensitivity of sorting would be significantly 
reduced. This could be achieved by (l)first publishing an coarse equi-width his- 
togram with large width. (2) Next, for each bin, use the deterministic padding 
algorithm (Section [3] Remark [2| to extract n points, where n is the noisy count 
output by the equi-width histogram. (3) Finally, publish the extracted points 
using our fixed-depth mechanism. Note that the sensitivity for the fixed-depth 
mechanism is the width (or area) of the bin, which could be significantly smaller 
than the width (or area) of the whole domain. 

Fixed-depth-then-fixed-width The unique solution of isotonic regression is 
a piecewise constant function. The steps in the solution lead to artifacts of clus- 
tered data. It is interesting to investigate whether a subsequent fixed-width his- 
togram could "break" the steps. 

7.2 Other Metric 

It is interesting to investigate whether the proposed techniques can be applied 
to multidimensional data other than spatial data, for instance, tuples with at- 
tributes of age and gender. 

8 Related Work 

There are extensive works on privacy-preserving data publishing. The recent 
survey by Fung et al. [5] gives a comprehensive overview on various notions, for 
example, fc-anonymity |24) . ^-diversity [TS], and differential privacy [5]. 

Hay et al. [TT] proposed exploiting redundancies in the published data to boost 
accuracy, with supporting examples. One of the examples employs isotonic re- 
gression but in a way different from our mechanism. They consider publishing 
unattributed histogram, which is the (unordered) multiset of the frequencies of a 
histogram. As the frequencies are unattributed (i.e. order of appearance is irrele- 
vant) , Hay et al. proposed publishing the sorted frequencies and later employing 
isotonic regression to improve accuracy. In contrast, our mechanism publishes 
the whole database. It is no doubt that median is an important statistic. Find- 
ing median in a differentially private way is not easy due to the large global 
sensitivity. Nissim et al. |19j introduced the notion of smooth sensitivity and pro- 
posed an 0{n 2 ) algorithm that computes the smooth sensitivity of an instance 
w.r.t. median. Median has also been used in the construction of other differential 
private mechanisms, for e.g. dataset learning [3] and spatial decompositions |4j. 

9 Conclusion and discussions 

Our mechanism is very simple from the publisher's point of view. The publisher 
just has to sort the points, group consecutive values, add Laplace noise and 



publish the noisy data. There is also minimal tuning to be carried out by the 
publisher. The main design decision is the choice of the group size k, which can 
be determined using our proposed noise models, and the locality preserving map 
which the classic Hilbert curve is suffice in attaining high accuracy. Through 
empirical studies, we have shown that the published raw data contain rich infor- 
mation for the public to harvest, and provide high accuracy even for usages like 
median-finding, and range-searching that our mechanism is not initially designed 
for. Such flexibility is desired for the need of "publish data, not the data mining 
result' as deliberated by Fung et al. [5]. 
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